Rule Building and Tuning Guide
This guide provides practical advice for building, testing, and tuning Content Identification rules to ensure they provide accurate detection with minimal false positives. Use this guide alongside the reference documentation to create effective detection rules.
Getting Started
Understanding Your Requirements
Before building rules, clearly define your detection objectives:
- What data to detect: Specific data types (SSN, credit cards, etc.)
- Where it appears: File types, applications, communication channels
- Accuracy requirements: Acceptable false positive/negative rates
- Performance constraints: Processing time and resource limitations
Planning Your Approach
- Start with existing rules: Review predefined rules for similar use cases
- Gather sample data: Collect representative content for testing
- Define success criteria: Set measurable goals for accuracy and performance
- Plan iterative development: Build, test, refine in cycles
Building Your First Rule
Step 1: Create a Basic Rule Pack
Start with a simple rule pack structure:
<?xml version="1.0" encoding="UTF-8"?>
<RulePackage xmlns="http://schemas.microsoft.com/office/2011/mce">
<RulePack id="my-custom-rules">
<Version major="1" minor="0" build="0" revision="0"/>
<Publisher id="my-organization"/>
<Details defaultLangCode="en">
<LocalizedDetails langcode="en">
<PublisherName>My Organization</PublisherName>
<Name>Custom Detection Rules</Name>
<Description>Custom rules for detecting sensitive data</Description>
</LocalizedDetails>
</Details>
<Rules>
<!-- Rules will go here -->
</Rules>
<Resources>
<!-- Shared resources will go here -->
</Resources>
</RulePack>
</RulePackage>
Step 2: Define Shared Resources
Create reusable resources for keywords and patterns:
<Resources>
<!-- Keywords for financial terms -->
<Keyword id="financial-keywords">
<Group matchStyle="word">
<Term>account</Term>
<Term>balance</Term>
<Term>payment</Term>
<Term>transaction</Term>
</Group>
</Keyword>
<!-- Pattern for account numbers -->
<Regex id="account-number-pattern">
<Pattern>\b\d{8,12}\b</Pattern>
</Regex>
</Resources>
Step 3: Create a Simple Detection Rule
Start with a basic entity rule:
<Rules>
<Entity id="bank-account-detection" patternsProximity="300" recommendedConfidence="75">
<Pattern confidenceLevel="85">
<IdMatch idRef="account-number-pattern"/>
<Match idRef="financial-keywords"/>
</Pattern>
</Entity>
</Rules>
Testing and Validation
Creating Test Content
Develop comprehensive test content that includes:
- Positive samples: Content that should match your rules
- Negative samples: Similar content that should not match
- Edge cases: Boundary conditions and unusual formats
- Real-world samples: Actual content from your environment
Test Content Examples
Positive Test Cases:
Account number: 123456789
Payment to account 987654321
Transaction for account #555666777
Negative Test Cases:
Phone number: 123456789
Order number: 987654321
Reference ID: 555666777
Edge Cases:
Account: 12345678 (minimum length)
Account: 123456789012 (maximum length)
Acct 123-456-789 (with formatting)
Testing Methodology
- Unit Testing: Test individual patterns and keywords
- Integration Testing: Test complete rules with all components
- Performance Testing: Measure processing time with large content
- Accuracy Testing: Calculate precision and recall metrics
Measuring Accuracy
Calculate key metrics to assess rule performance:
- Precision: True Positives / (True Positives + False Positives)
- Recall: True Positives / (True Positives + False Negatives)
- F1 Score: 2 × (Precision × Recall) / (Precision + Recall)
Tuning for Better Accuracy
Reducing False Positives
Problem: Rules match unintended content
Solutions:
-
Add Context Keywords: Require supporting evidence
<Pattern confidenceLevel="85">
<IdMatch idRef="number-pattern"/>
<Any minMatches="1">
<Match idRef="financial-keywords"/>
<Match idRef="banking-keywords"/>
</Any>
</Pattern> -
Use Exclusion Patterns: Filter out known false positives
<Pattern confidenceLevel="80">
<IdMatch idRef="ssn-pattern"/>
<Match idRef="personal-context"/>
<Not>
<Match idRef="test-data-keywords"/>
</Not>
</Pattern> -
Adjust Proximity Settings: Reduce distance between patterns
<Entity id="precise-detection" patternsProximity="150">
<!-- Patterns must be closer together -->
</Entity> -
Increase Confidence Thresholds: Require higher confidence
<Pattern confidenceLevel="90"> <!-- Increased from 75 -->
<IdMatch idRef="validated-pattern"/>
<Match idRef="strong-context"/>
</Pattern>
Reducing False Negatives
Problem: Rules miss legitimate sensitive content
Solutions:
-
Add Alternative Patterns: Cover different formats
<Entity id="comprehensive-detection">
<Pattern confidenceLevel="90">
<IdMatch idRef="formatted-pattern"/>
<Match idRef="context-keywords"/>
</Pattern>
<Pattern confidenceLevel="75">
<IdMatch idRef="unformatted-pattern"/>
<Any minMatches="2">
<Match idRef="context-keywords"/>
<Match idRef="supporting-keywords"/>
</Any>
</Pattern>
</Entity> -
Expand Keyword Lists: Include synonyms and variations
<Keyword id="expanded-financial-terms">
<Group matchStyle="word">
<Term>account</Term>
<Term>acct</Term>
<Term>account number</Term>
<Term>account #</Term>
<Term>bank account</Term>
<Term>checking</Term>
<Term>savings</Term>
</Group>
</Keyword> -
Use Broader Patterns: Include more variations
<Regex id="flexible-ssn-pattern">
<Pattern>\b\d{3}[-.\s]?\d{2}[-.\s]?\d{4}\b</Pattern>
</Regex> -
Lower Confidence Thresholds: Accept lower confidence matches
<Pattern confidenceLevel="65"> <!-- Decreased from 75 -->
<IdMatch idRef="broad-pattern"/>
<Match idRef="weak-context"/>
</Pattern>
Advanced Tuning Techniques
Multi-Pattern Rules
Create rules with multiple patterns for different scenarios:
<Entity id="credit-card-comprehensive" patternsProximity="300" recommendedConfidence="80">
<!-- High confidence: validated format with strong context -->
<Pattern confidenceLevel="95">
<IdMatch idRef="Func_credit_card_formatted"/>
<Any minMatches="1">
<Match idRef="credit-card-keywords"/>
<Match idRef="payment-keywords"/>
</Any>
</Pattern>
<!-- Medium confidence: pattern with multiple context clues -->
<Pattern confidenceLevel="80">
<IdMatch idRef="credit-card-regex"/>
<Any minMatches="2">
<Match idRef="credit-card-keywords"/>
<Match idRef="payment-keywords"/>
<Match idRef="financial-keywords"/>
</Any>
</Pattern>
<!-- Lower confidence: pattern with strong context -->
<Pattern confidenceLevel="70">
<IdMatch idRef="number-pattern"/>
<Any minMatches="3">
<Match idRef="visa-keywords"/>
<Match idRef="mastercard-keywords"/>
<Match idRef="payment-context"/>
<Match idRef="financial-context"/>
</Any>
</Pattern>
</Entity>
Contextual Tuning
Adjust rules based on content context:
<!-- Rule for structured forms -->
<Entity id="form-ssn-detection" patternsProximity="100">
<Pattern confidenceLevel="90">
<IdMatch idRef="Func_ssn_formatted"/>
<Match idRef="form-keywords"/>
</Pattern>
</Entity>
<!-- Rule for unstructured documents -->
<Entity id="document-ssn-detection" patternsProximity="400">
<Pattern confidenceLevel="85">
<IdMatch idRef="Func_ssn_formatted"/>
<Any minMatches="2">
<Match idRef="personal-keywords"/>
<Match idRef="government-keywords"/>
<Match idRef="identity-keywords"/>
</Any>
</Pattern>
</Entity>
Language-Specific Tuning
Create localized versions for different languages:
<Keyword id="financial-terms-multilingual">
<Group matchStyle="word" langcode="en">
<Term>account</Term>
<Term>payment</Term>
<Term>balance</Term>
</Group>
<Group matchStyle="word" langcode="es">
<Term>cuenta</Term>
<Term>pago</Term>
<Term>saldo</Term>
</Group>
<Group matchStyle="word" langcode="fr">
<Term>compte</Term>
<Term>paiement</Term>
<Term>solde</Term>
</Group>
</Keyword>
Performance Optimization
Pattern Optimization
-
Use Anchored Regex: Include word boundaries
<!-- Good: Uses word boundaries -->
<Pattern>\b\d{3}-\d{2}-\d{4}\b</Pattern>
<!-- Avoid: No anchoring -->
<Pattern>\d{3}-\d{2}-\d{4}</Pattern> -
Avoid Backtracking: Use non-capturing groups
<!-- Good: Non-capturing group -->
<Pattern>\b(?:\d{4}[-\s]?){3}\d{4}\b</Pattern>
<!-- Avoid: Capturing group -->
<Pattern>\b(\d{4}[-\s]?){3}\d{4}\b</Pattern> -
Optimize Quantifiers: Be specific about repetition
<!-- Good: Specific repetition -->
<Pattern>\b\d{3}-\d{2}-\d{4}\b</Pattern>
<!-- Avoid: Greedy quantifier -->
<Pattern>\b\d+-\d+-\d+\b</Pattern>
Rule Ordering
Order patterns by selectivity (most specific first):
<Entity id="optimized-rule">
<!-- Most selective pattern first -->
<Pattern confidenceLevel="95">
<IdMatch idRef="highly-specific-pattern"/>
</Pattern>
<!-- Less selective patterns follow -->
<Pattern confidenceLevel="80">
<IdMatch idRef="moderately-specific-pattern"/>
<Match idRef="context-keywords"/>
</Pattern>
<!-- Least selective pattern last -->
<Pattern confidenceLevel="65">
<IdMatch idRef="broad-pattern"/>
<Any minMatches="3">
<Match idRef="context1"/>
<Match idRef="context2"/>
<Match idRef="context3"/>
</Any>
</Pattern>
</Entity>
Resource Management
- Limit Keyword Lists: Keep lists under 1000 terms
- Share Resources: Reuse common patterns and keywords
- Optimize Proximity: Use smallest effective values
- Monitor Memory: Track resource usage during testing
Troubleshooting Common Issues
Issue: Rule Not Matching Expected Content
Diagnosis Steps:
- Verify pattern syntax with regex testing tools
- Check keyword spelling and case sensitivity
- Validate proximity settings are appropriate
- Ensure context elements are present in test content
Solutions:
- Test patterns in isolation
- Add debug logging to identify failure points
- Review evaluation context for missing elements
- Adjust proximity values incrementally
Issue: Too Many False Positives
Diagnosis Steps:
- Analyze false positive samples
- Identify common characteristics
- Review confidence thresholds
- Check for missing exclusion patterns
Solutions:
- Add exclusion keywords for common false positives
- Increase confidence requirements
- Add more specific context requirements
- Reduce proximity values
Issue: Performance Problems
Diagnosis Steps:
- Profile rule execution times
- Identify slow patterns
- Check for regex backtracking
- Monitor resource usage
Solutions:
- Optimize regex patterns
- Reduce pattern complexity
- Order patterns by performance
- Consider rule splitting
Deployment and Monitoring
Staged Deployment
- Development: Test with sample content
- Staging: Test with production-like data
- Limited Production: Deploy to subset of content
- Full Production: Deploy to all content streams
Monitoring Metrics
Track these key metrics in production:
- Match Rate: Detections per unit of content
- Confidence Distribution: Spread of confidence levels
- Performance: Processing time per rule
- False Positive Rate: User-reported incorrect matches
Continuous Improvement
- Regular Review: Assess rule performance monthly
- Feedback Integration: Incorporate user feedback
- Pattern Updates: Keep patterns current with new threats
- Performance Monitoring: Watch for degradation over time
Best Practices Summary
Rule Design
- Start simple and add complexity gradually
- Use multiple patterns with different confidence levels
- Include comprehensive context keywords
- Test with diverse, representative content
Performance
- Optimize regex patterns for efficiency
- Order patterns by selectivity
- Use appropriate proximity values
- Monitor resource usage
Maintenance
- Version control all rule changes
- Document rule logic and intent
- Regular performance reviews
- Keep keyword lists current
Testing
- Create comprehensive test suites
- Include positive, negative, and edge cases
- Measure accuracy with precision/recall
- Test performance with large content samples